The majority of the content in this resource is from the book Recommender Systems: The Textbook (Aggarwal 2016).
GOAL:
What algorithms are available?
What are their strengths/weaknesses?
Where are they applicable/not applicable?
How (at a high level) do they work
Although I use
\[r_{ij}=\text{rating of item } j \text{ by user } i\]
as the outcome of interest throughout this document, this can be more generally understood as
\[\text{user } i \text{'s affinity for item } j\]
The exact outcome of interest will depend on the recommendation context (e.g. the outcome of interest might be ‘probability of click’ in a web recommendation context).
However, there are some important secondary goals that are also very important in many cases:
Novelty: recommended items should be ones that a user has not seen before (or could not easily find on their own).
Serendipity: item recommendations should sometimes be unexpected (pleasantly surprising) to the user. Serendipity is “finding valuable or pleasant things that are not looked for.” (Kaminskas and Bridge 2016)
image source: author
Diversity: “In information retrieval.. [covering] a broad area of the information space increases the chance of satisfying the user’s information need” (Kaminskas and Bridge 2016). This is because the user’s intent is somewhat ambiguous (e.g. whether “bat” refers to an animal or to sporting equipment), and returning diverse results corresponding to the query makes it more likely that the user’s desired item is within the result set. This applies equally to recommender systems: since there is some ambiguity in understanding what the user wants, it is safer not to put all of our eggs into one basket.
Coverage: “Coverage reflects the degree to which the generated recommendations cover the catalogue of available items” (Kaminskas and Bridge 2016). This is important both for users (it improves the usefulness of the system) and for business-owners (because showing users the same small subset of the item catalogue might impact stock level management of physical items, and also because there has been a societal shift in consumer demand for products in the long tail of the catalogue (Armstrong 2008)).
There is a good discussion of this topic (the multiple goals of recommender systems) and of how these outcomes can be optimized and objectively measured in the paper Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems (Kaminskas and Bridge 2016).
Some further notes:
I think that there is an important distinction to be made between search engines (which are primarily concerned with information retrieval - finding an item that you are looking for) and recommender systems (which are primarily concerned with information filtering - discovering items that you will likely be interested in). Traditionally, search engines tended to be user-agnostic, whereas recommendation systems somewhat personalized to the particular user.
To go a level deeper: the preference for relevance, novelty, serendipity, diversity and catalogue coverage of recommendations likely differs per user - some users might prefer more narrow/conservative recommendations, while another might prefer more variety.
Hyper Personalisation: Try to guess exactly which subset of items each individual user wants and show it to them.
Similar Products: Show user variants of an item that they are currently browsing/interested in (differing on 1 or more product dimensions) e.g. “same item but cheaper”, “same item but in red”, “similar item from another brand.”
Complementary Products: Recommend items which go well with/add value to/are often bought alongside the user’s current item (or one which they already own).
Popular/Trending: Highlight items popular/trending globally, within a user’s segment, or within the current context (e.g. current time, current season, user-chosen genre etc.).
Product Replenishment: Predict timing of user need for an item to be renewed/replenished/replaced.
Search Engine: Give users an intelligent system for navigating the item catalogue. See also Knowledge-Based Recommendation Systems.
TODO: need to complete this list!
The problem of predicting a particular user’s affinity for a particular item can be framed as a supervised learning problem, allowing us to use any one of the available plethora of performant classification, regression or ranking models (generalized linear models, gradient boosting, random forest, deep learning etc.).
The user rating (affinity) for a particular item is modelled as a function of the user’s attributes, the item’s attributes and (possibly) information on the recommendation context:
\[\begin{array}{lcl} r_{ijk} &=& f\Big(\mathbf{x}^{(user)}_i, \mathbf{x}^{(item)}_j, \mathbf{x}^{(context)}_k\Big) \\ \mathbf{x}^{(user)}_i &=& \text{vector of user attributes} \\ \mathbf{x}^{(item)}_j &=& \text{vector of item attributes} \\ \mathbf{x}^{(context)}_k &=& \text{vector of recommendation context information} \\ \end{array}\]
A strength of this method is that it can mitigate the recommendation cold start problem (in which recommendations cannot be generated for new users and/or items) by sharing information across users and items.
Some specific examples of recommender architectures based on supervised learning are:
The 2 Tower model
The Wide & Deep model
The Deep & Cross model
Another use of supervised learning in generating item recommendations is to build a separate model for every individual user using item attributes as model features: refer to Content-Based Recommendation.
| user_ID | movie | user_rating | length_hours | origin_country | budget_millions | genre | description | thrilling | exploding | adventure | exotic | drama | love |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 46290 | movie A | 5 | 1.5 | USA | 50 | action | a thrilling exploding adventure | 1 | 1 | 1 | 0 | 0 | 0 |
| 46290 | movie B | 3 | 2.0 | India | 60 | drama | an exotic adventure full of drama | 0 | 0 | 1 | 1 | 1 | 0 |
| 46290 | movie C | 4 | 1.5 | UK | 100 | action | things exploding everywhere | 0 | 1 | 0 | 0 | 0 | 0 |
| 46290 | movie D | 1 | 3.0 | USA | 4 | romance | an american love story drama | 0 | 0 | 0 | 0 | 1 | 1 |
Content-based recommendation systems generate recommendations using item attributes data. For example, over time one could build an item attribute profile for a particular user (e.g. “user likes Italian and Indian food, but never orders red meat or chilli”), and use this profile to tailor their future item recommendations.
An example of a content-based model is:
\[\begin{array}{lcl} \hat{r}_{ij} &=& \text{predicted rating/affinity for item } j \text{ by user } i \\ &=& f\Big(\overset{\rightarrow{}}{\mathbf{v}}_j, \mathcal{D}_L^{(i)}\Big) \\ &=& \text{some function of the attributes of item } j \text{ and the attributes of the items that user } i \text{ has previously consumed} \\ \overset{\rightarrow{}}{\mathbf{v}}_j &=& \text{vector of attributes of item } j \\ \mathcal{D}_L^{(i)} &=& \text{the set of items previously consumed by user } i \text{ (the item attribute vectors)} \\ f() &=& \text{any chosen function (e.g. nearest neighbour model)} \\ \end{array}\]
Some examples of function \(f()\) are:
The cosine distance between \(\overset{\rightarrow{}}{\mathbf{v}}_j\) and an aggregation of the vectors in \(\mathcal{D}_L^{(i)}\) (mean/max/sum etc.)
A supervised learning model using item attributes \(\overset{\rightarrow{}}{\mathbf{v}}_j \space, \mathcal{D}_L^{(i)}\) as features. Note that a content-based system builds a separate model for each individual user, which is notably different to the global models trained over all users described in this section.
Example: training a regression model to predict movie ratings for user 46290 using the data in the table “Example User Item Consumption History” above
In this particular context, it is much more important that the chosen model is robust to overfitting (e.g. elastic net, naive bayes), since in this case the data is likely to be wide (many features) and short (few examples).
Association Rules-Based Classifiers of the form
\(\{\text{item contains feature set A}\}=>\{\text{rating="like"}\}\)
Example: \(\{\text{item_material="leather", item_colour="red"}\}=>\{\text{rating=dislike}\}\)
Refer also to Association Rules-Based Recommendation
A neighbourhood-based model: the predicted rating for a new item (attribute vector \(\overset{\rightarrow{}}{\mathbf{v}}_j\)) is calculated as the aggregated rating (vote) over the closest \(k\) items to \(\overset{\rightarrow{}}{\mathbf{v}}_j\) in _L^{(i)} (using cosine similarity, euclidean distance, manhattan distance etc.)
A content-based strategy is particularly effective in situations in which:
There is rich and predictive item attribute data available (structured or raw text)
Past user item consumption is predictive of their future preferences
Compared to collaborative filtering:
content-based recommendation can robustly recommend items with few or no ratings (i.e. it alleviates the cold start problem for new items)
content-based recommendation tends to produce more relevant (if also more boring and obvious) recommendations. This is because it cannot recommend items outside of the users historic item attribute consumption (i.e. it won’t recommend items outside of the scope of what they have consumed in the past)
If users are able to explicitly specify their own preferences (like a knowledge-based recommendation system), then this information can be incorporated into their item attribute preference profile and affect their recommendations.
TODO
image source: author
Collaborative Filtering refers, in general, to the process of inferring the unobserved preferences of entities (e.g. users) using the observed preferences of other entities (e.g. other users). It is “collaborative” in the sense that entities (unwittingly) contribute information toward each others recommendations.
In a recommendation context, this traditionally means inferring the missing entries in a partially observed user/item ratings matrix using the structure among the observed entries in the matrix.
It sometimes makes sense to preprocess the ratings matrix before making inferences on it e.g. mean-centring ratings within each user to mitigate individual user bias.
image source: author
Predict an unobserved rating for item \(j\) by user \(i\) as the average rating of item \(j\) across the \(k\) users most similar to user \(i\) who have rated item \(j\). This average can (if desired) be weighted proportionately to each user’s similarity to user \(i\).
Similarity between users is traditionally defined using the distance between their item rating vectors (i.e. theirs rows in the user/item matrix). Vector distance metrics such as cosine similarity, euclidean distance, manhattan distance etc. can be used for this.
‘Closest \(k\) neighbours’ can (if desired) be replaced by ‘all neighbours within a distance of d’.
Compared to Item-Item Similarity, User-User Similarity tends to produce more serendipitous - if sometimes less relevant - recommendations (see also Combining User-User & Item-Item Similarity).
image source: author
Item-Item collaborative filtering is mathematically identical to User-User, except that the calculation is performed on columns of the user/item rating matrix rather than columns
i.e. an unobserved rating for item \(j\) by user \(i\) is estimated as the average rating by user \(i\) over the \(k\) most similar items to item \(j\) that user \(i\) has rated.
Compared to User-User Similarity, Item-Item Similarity tends to produce more relevant - if sometimes more boring/obvious - recommendations (see also Combining User-User & Item-Item Similarity).
Since any missing (unobserved) entry \(r_{ij}\) in the user/item ratings matrix can be estimated using either User-User Similarity (\(\hat{r}_{ij}^{(user)}\)) or Item-Item Similarity (\(\hat{r}_{ij}^{(item)}\)), a given missing entry can also be estimated as a weighted average of the two:
\[\hat{r}_{ij} \quad=\quad \alpha \space \hat{r}_{ij}^{(user)} + (1-\alpha) \space \hat{r}_{ij}^{(item)} \quad,\quad\alpha\in[0,1]\]
..where the hyperparameter \(\alpha\) can be used to control the balance between recommendation relevance and recommendation serendipity.
Matrix factorization refers to the process of learning a low-rank approximation of the user/item ratings matrix, then using this low-rank representation to infer the missing (unobserved) ratings in the matrix.
image source: author
For more information, refer to Matrix Factorization (Latent Factor Models)
<TODO: very short description here>
refer to Graph-Based Collaborative Filtering
Multi-armed bandit algorithms are a class of reinforcement learning model which are designed to behave optimally in an environment in which one must repeatedly choose from one of a discrete set of available options (observing a noisy reward signal after each decision).
image source: author
The classic analogy for this is a gambling machine with multiple levers, each lever having an unknown reward distribution: the player must balance exploration (pulling levers not pulled much before) with exploitation (pulling levers historically observed to generate high reward) in order to maximise total reward over multiple lever pulls (iterations).
Multi-armed bandit algorithms are important in a recommendation context in which the system must make decisions online i.e. a user reacts to a provided set of item recommendations and there is insufficient time to retrain a model incorporating this information before the user requires a new set of item recommendations (recommending them the same items again is a bad user experience).
Example: item recommendations are available from 5 different recommendation models (all trained offline) and we would like to investigate which of the 5 (or what weighting of them) is most appropriate for a particular user. We would like to respond to user feedback (e.g. click/no click on a recommended product) in real time i.e. that user’s feedback will affect what they are recommended next, without us having to wait for the models to retrain.
Some examples of popular multi-armed bandit algorithms are:
Upper Confidence Bound (UCB)
\(\epsilon\)-Greedy (and variants)
An extension called Contextual Multi-Armed Bandit algorithms explicitly incorporate context into the arm-selection decision. Some examples of these are:
linUCB (L. Li et al. 2010)
Greedy Linear Bandit & Greedy-First Linear Bandit (Bastani, Bayati, and Khosravi 2017)
RegCB (Foster et al. 2018)
This is a simple and heuristic method (created by Vincent Warmerdam) that was found to substantially increase the coverage of item recommendations in a commercial movie recommendation context.
It is a system for next item recommendation.
It shares some conceptual similarities with Association Rules-Based Recommendation
\[\begin{array}{lcl} R_{j\rightarrow{}i} &=& \text{relative affinity for item } i \text{ among item } j \text{ consumers compared to the rest of the user population} \\ &=& \displaystyle\frac{\text{% of item } j \text{ consumers have also consumed item }i}{\text{% of non-item } j \text{ consumers have also consumed item }i} \\ &=& \displaystyle\frac{p(s_i\bigl|s_j)}{p(s_i\bigl|s_j^c)} \\ &\approx& \displaystyle\frac{p(s_i\bigl|s_j)}{p(s_i\bigl)} \\ \end{array}\]
For users who have consumed item \(j\), the item \(i\) with the highest score \(R_{j\rightarrow{}i}\) should be recommended to them.
For items with a low amount of user interaction, the \(R_{j\rightarrow{}i}\) score will be unstable. This could be alleviated by making the calculation Bayesian, and shrinking the metric using a chosen prior distribution.
Association Rules are rules of the form
\[\overset{\text{antecedant set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{consequent set}}{\{...\}}\]
Very efficient algorithms such as the a priori algorithm have been devised for searching data for rules of this form.
Association Rules are particularly effective in the case of unary ratings.
Here are some of the ways that association Rules can be used to generate item recommendations:
Item-wise Recommendations:
\(\overset{\text{item set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{item set}}{\{...\}}\)
Example: \(\{\text{bread, tomato}\}=>\{\text{cheese}\}\quad\) i.e. users who have bought bread and tomato should be recommended cheese.
User-wise Recommendations
\(\overset{\text{user set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{user set}}{\{...\}}\)
Example: \(\{\text{alice, bob}\}=>\{\text{john}\}\quad\) i.e. if users “Alice” and “Bob” have bought bought an item then “John” is also likely to like it
Profile Assocation Rules:
\(\overset{\text{user attribute set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{item set}}{\{...\}}\)
Example: \(\{\text{male, age30-39, 2_children}\}=>\{\text{home loan}\}\quad\) i.e. a large proportion of male users in their 30s with 2 children have consumed the item “home loan”, making it a promising recommendation for this user segment
Sparsity of the observed data in the user/item ratings matrix is problematic for neighbourhood calculation in neighbourhood-based collaborative filtering models. This problem is elegantly solved using graph representations of user/item relationships.
Relationships between users and items can be cleanly described (and modelled) using graphs, in which nodes represent entities (e.g. user or item) and edges represent relationships between them.
Graphs define a novel measure of distance (dissimilarity) between entities: the length of a path between them, travelling along edges (either the shortest path, or using a random walk).
image source: author
If required, edges can have different weights, describing the strength (and sign) of the connection (relationship) between nodes (entities).
Graphs are an effective solution to the problem of data sparsity e.g. most users have rated only a tiny proportion of all items, because even users with no items in common can be linked through intermediate users in the graph.
Graphs structures can be coded directly (e.g. NetworkX), or using a model (there are MANY deep learning approaches). Model-based methods also facilitate tasks such as link (edge) prediction.
See also: Graph Neural Networks
Matrix factorization refers to the process of learning a low-rank approximation of the user/item ratings matrix then using this low-rank representation to infer the missing (unobserved) ratings in the matrix.
image source: author
There are many different ways in which this matrix factorization can be performed, each of which has various different strengths and weaknesses. These variants are defined by the constraints which they impose on the latent factors. Imposing constraints on the latent factors will always decrease accuracy (increase error) on the observed (training) data, but these constraints can also improve model generalization (error on unobserved data) and increase model interpretability.
Here is a summary of factorization methods from (Aggarwal 2016) -
| Method | Constraints on factor matrices | Advantages/disadvantages |
|---|---|---|
| Unconstrained | none | Highest quality solution Good for most matrices Regularisation prevents overfitting Poor interpretability |
| Singular Value Decomposition (SVD) | orthogonal basis | Good visual interpretability Out-of-sample recommendations Good for dense matrices Poor semantic interpretability Suboptimal in sparse matrices |
| Maximum Margin | none | Highest quality solution Resists overfitting Similar to unconstrained Poor interpretability Good for discrete ratings |
| Non-Negative Matrix Factorization (NMF) | non-negativity | Good quality solution High semantic interpretability Loses interpretability with both like/dislike ratings Less overfitting in some cases Best for implicit feedback |
| Probabilistic Latent Semantic Analysis (PLSA) | non-negativity | Good quality solution High semantic interpretability Probabilistic interpretation Loses interpretability with both like/dislike ratings Less overfitting in some cases Best for implicit feedback |
Matrix factorization models can also be combined with other recommender models within a single model architecture. Refer to:
Latent factor models can also explicitly model recommendation context (refer to Contextual Latent Factor Models)
Knowledge-based recommender systems generate recommendations by combining an explicit user query with domain knowledge/algorithmic intelligence (e.g. user attributes, item attributes, historic item consumption data, recommendation context, item availability etc.). In this way, they lie somewhere on the spectrum between a pure queryless recommender system and a search engine.
Knowledge-based recommendation is typically facilitated through an iterative/cyclic exchange between the user and an interactive user interface, for example:
\[\text{user query} >> \text{query result} >> \text{refined user query} >> \text{refined query result} >> \text{refined user query} >> \text{refined query result} >> ...\text{etc. (until user finds a satisfactory item)}\] Knowledge-based recommender systems are well-suited to recommendation contexts in which each item is unique, and not sold often (e.g. houses and cars).
User provides a set of constraints, and item recommendations are then provided from the subset of items matching the user constraints (the items within the subset can be further ranked by a relevance model). Example constraints for a house search:
“homes in East London”
“price < $100”
“bathrooms > 2”
After receiving the query result, the user can refine their constraint set (make the search more liberal or more conservative) and rerun the query.
source: https://www.webuycars.co.za
User provides a case/target/anchor item, and the recommender system then finds other items similar to it, possibly along user-specified item dimensions. Example “please find me songs similar to “Come Together” by the Beatles.
source: https://www.chosic.com
A hybrid system is one that integrates multiple different models into a single combined architecture.
Here are some examples:
Weighted Hybrid: the final hybrid model prediction is a weighted sum of the predictions of each submodel:
\(\begin{array}{lcl} \hat{r}_{ij} &=& \text{predicted rating of (affinity to) item } j \text{ by user } i \\ &=&\displaystyle\sum_{m=1}^M \alpha_m \hat{r}_{ij}^{(m)} \quad,\quad \sum_{m=1}^M\alpha_m=1 \\ \hat{r}_{ij}^{(m)} &=& \text{model } m \text{: predicted rating of (affinity to) item } j \text{ by user } i \\ \end{array}\)
Switching Hybrid: The recommendation context dictates which model (out of a selection of models) is used for a particular case (e.g. recommendation system A is used for new users and system B for returning users)
Feature Augmentation Hybrid: TODO
Feature Combination Hybrid: TODO
Meta-Level Hybrid: TODO
Graph Neural Networks (GNNs) are a powerful tool for (elegantly) incorporating user/item interaction data (collaborative filtering), user attribute data (demographic filtering), and item attribute data (content-based filtering) within a single model.
They are particularly powerful for recommendation tasks since the relationships between users and items are often well-captured by a graph representation.
Examples of Different Graph Representations. Image source: author
GNNs can also model heterogeneous relationships (link types) between nodes (e.g. an edge between a user node and item node within the same graph could represent either a “view”, a “click” or a “purchase”).
GNNs learn a unique \(d^{(K)}\)-dimensional real-valued vector representation (embedding) of each node in a given graph. These node embeddings can then be used directly as model features (inputs) in downstream modelling tasks (e.g. edge/link prediction).
The node embeddings are learned using an encoder/decoder architecture:
An encoder model takes in a raw numeric description vector of a node as input (either simply a node ID look-up vector, or else a feature/attribute vector for the node) and outputs a node embedding for that node (a custom vector representation of the node that is optimized in order to be used for a specific downstream task).
A decoder model takes in the output of the encoder model for a given node (i.e. a node embedding vector) and outputs information which can be used to reconstruct the local structure of the original graph around that node.
The parameters of the encoder and decoder model are jointly optimized during model training. For example, the encoder and decoder might be jointly trained in order to predict whether any give pair of nodes are connected (neighbours) or not in the original graph.
Image Source: Hamilton (2022)
Here is a general description of a typical structure for the encoder model:
Each node \(u\) is initialized with a real-valued vector representation \(\mathbf{x}_u=\mathbf{h}_u^{(0)}\) (containing the nodes attribute data)
Each node (\(u\))’s vector representation \(\mathbf{h}_u^{(0)}\) is updated by combining it with an aggregation of the vector representations of it’s immediate neighbours (this is called message passing):
\[\begin{array}{lcl} \underset{\text{vector representation }\\\text{of node } u \text{ at iteration } \\k+1}{\underbrace{h_u^{(k+1)}}} &=& \underset{\text{some chosen}\\\text{function}\\\text{(e.g. neural net)}}{\underbrace{\text{UPDATE}}}\Big(\underset{\text{vector representation}\\\text{of node } u \text{ at}\\\text{iteration } k}{\underbrace{\mathbf{h}_u^{(k)}}}, \underset{\text{some aggregation of the vector}\\\\\text{representations (at iteration } k \text{)}\\\text{of the nodes linked to } u \\\text{ (i.e. combined information from}\\u\text{'s immediate 1-hop neighbours) }}{\underbrace{\mathbf{m}^{(k)}_{\mathcal{N}(u)}}}\Big) \\ \mathbf{m}^{(k)}_{\mathcal{N}(u)} &=& \underset{\text{some chosen}\\\text{function}\\\text{(e.g. sum)}}{\underbrace{\text{AGGREGATE}}}\bigg(\underset{\text{set of vector representations}\\\text{of all nodes neighbouring}\\\text{(directly linked to) node } u \\\text{(at iteration }k\text{)}}{\underbrace{\Big\{\mathbf{h}_v^{(k)},\forall v \in \mathcal{N}(u)\Big\}}}\bigg)\\ \end{array}\]
Two Layer Message-Passing Example. Image source: Hamilton (2022) (with minor modifications)
The choices of the \(\text{UPDATE()}\) and \(\text{AGGREGATE()}\) functions define the encoder architecture. Very many different ones have been proposed (refer to e.g. Hamilton (2022)).
Here is a basic example (from Hamilton (2022)):
\[\begin{array}{lcl} \mathbf{h}_u^{(k+1)} &=& \overset{\text{UPDATE()}}{\overbrace{\sigma\Bigg(\mathbf{W}_{self}^{(k+1)}\mathbf{h}_u^{(k)}+\mathbf{W}_{neigh}^{k+1}\underset{\text{AGGREGATE()}}{\Big(\underbrace{\displaystyle\sum_{v\in \mathcal{N}(u)}\mathbf{h}_v^{(k)}}\Big)} + \mathbf{b}^{(k+1)}\Bigg)}} \\ \sigma() &=& \text{element-wise non-linear function ('activation function') such as } tanh, sigmoid, ReLU \text{ etc.} \\ \mathbf{W}_{self}^{(k+1)}, \mathbf{W}_{neigh}^{(k+1)} &\in& \mathbb{R}^{d^{(k+1)}\times d^{(k)}} \space \text{are matrices of trainable parameters (weights)}\\ \mathbf{b}^{(k+1)} &\in& \mathbb{R}^{d^{(k+1)}} \text{ is a vector of trainable parameters (weights)} \\ d^{(k+1)} &=& \text{dimension of vector representation (embedding) at iteration } k+1 \\ \end{array}\]
\[\overset{\mathbf{W}^{(k+1)}}{\begin{bmatrix}\cdot&\cdot&\cdot&\cdot\\\cdot&\cdot&\cdot&\cdot\end{bmatrix}} \overset{\mathbf{h}_*^{(k)}}{\begin{bmatrix}\cdot\\\cdot\\\cdot\\\cdot\end{bmatrix}} \quad=\quad \overset{\mathbf{h}_*^{(k+1)}}{\begin{bmatrix}\cdot\\\cdot\end{bmatrix}}\]
See also: Graphs can be used in order to directly model the similarity between users (or between items) for direct use in a collaborative filtering model - refer to Graph-Based Collaborative Filtering.
Factorization Machines (Rendle (2010)) are simply linear regression models with interactions, but in which the model coefficients on the interaction terms are modelled using a latent factor model. This factorization helps to avoid overfitting and improve generalization, especially when modelling sparse data. This is achieved by breaking the independence between the interaction coefficients (i.e. allowing/forcing information sharing between them).
image source: Rendle (2010)
The basic model is defined:
\[\begin{array}{lcl} \hat{y}_i(\overrightarrow{\mathbf{x}}_i) &:=& \underset{\text{standard linear regression}}{\underbrace{w_0 + \displaystyle{\sum_{i=1}^p}w_ix_i}} + \underset{\text{latent factor model of all 2-way interactions}}{\underbrace{\displaystyle{\sum_{i=1}^p\sum_{j=i+1}^p<\overrightarrow{\mathbf{v}}_i,\overrightarrow{\mathbf{v}}_j>}x_ix_j}} \\ \space &\space& w_0\in \mathbb{R}, \quad\overrightarrow{\mathbf{w}} \in \mathbb{R}^{p}, \quad \mathbf{V} \in \mathbb{R}^{p \times k} \\ \end{array}\]
The hyperparameter \(k\) (latent factor dimension) is a parameter to be tuned - it controls the flexibility (constrainedness) of the interaction portion of the model.
They are suitable for modelling extremely large (e.g. 1 billion rows) and sparse datasets (e.g. recommender engines).
The model equation can be computed in linear time \(O(kp)\) and can be optimized using gradient descent, which makes training feasible for extremely large datasets.
The model can be extended to capture higher order (e.g. 3-way) interactions (up to arbitrary order).
The model can be trivially modified in order to solve binary classification or ranking problems.
Factorization Machines are often included as a component/module in a more complex model (e.g. DeepFM).
In certain domains, the context (e.g. time, season, user location, user device etc.) in which the recommendation is delivered can have a material effect on the relevance of the recommendation. Obvious examples are the season (e.g. winter) in which clothing items are being recommended, or the location of the user for a restaurant recommendation.
The Multidimensional Ratings Cube (source: Aggarwal (2016))
Aggarwal (2016) describes 3 broad approaches:
Contextual Pre-Filtering: A separate recommendation model is built for each unique context (i.e. on a 2-Dimensional user/item “slice” of the ratings hypercube).
Contextual Post-Filtering: A global model (which ignores all contextual information) is built. The predictions from this model are then adjusted using contextual information. A very simple example of this is to built a recommendation model on all items, but then only allow winter clothing to be recommended in winter.
Contextual Modelling: Contextual information is explicitly used within the architecture of the recommendation model.
In contextual pre-filtering, a separate standard recommendation model is built for each unique context (i.e. on a 2-Dimensional user/item “slice” of the ratings hypercube).
For example, if the contextual dimensions were
user location
time of day
..then an independent model would be built for every unique user location/time of day combination, with each individual model train on only the observed ratings within that user location/time of day slice of the data.
This technique increases data sparsity, which achieves relevance at the cost of increased variance/risk of overfitting. The granularity of the context segment (slice of the data) can be adjusted in order to control the balance between relevance and sparsity. For example, one could model each of the time contexts as an hourly slice:
\[\{\text{8pm, 9pm, 10pm, ...}\}\]
..or rather reduce the granularity of this context to 3-hourly slices:
\[\{\text{"1pm-3pm", "4pm-6pm", "7pm-9pm", ...}\}\]
This granularity decision can be:
Heuristic: e.g. using the finest granularity containing at least \(n\) observed ratings
Ensemble-Based: model multiple different granularities using separate models and then combine their predictions into a single prediction.
Decided using Cross-Validation: find the optimal granularity on a chosen metric using holdout data
In contextual post-filtering, a global recommendation model (which ignores all contextual information) is built. The predictions from this global model are then adjusted using contextual information.
Aggarwal (2016) describes 2 general approaches:
Heuristic: Generate predictions for all user/item pairs using the global model, and then to simply screen out items which are irrelevant to the recommendation context at prediction time (e.g. show user the highest ranked winter items during winter).
Model-based: \[\begin{array}{lcl} \hat{r}_{ijc} &=& \text{predicted rating of item } j \text{ by user } i \text{ in context } c \\ &=& \overset{\text{local model}}{\overbrace{p(i,j,c)}} \quad \times \overset{\text{global model}}{\overbrace{\quad\hat{r}_{ij}\quad}} \\ p(i,j,c) &=& \text{predicted relevance of item } i \text{ to user } j \text{ in context } c \\ \hat{r}_{ij} &=& \text{predicted rating of item } i \text{ by user } j \text{ (using global model trained without context data)} \\ \end{array}\] note that \(p(i,j,c)\) could alternatively be replaced by a model \(p(j,c)\) which does not consider the user.
A contextual model explicitly incorporates the recommendation context data into the architecture of the model itself.
Some examples are:
Explanation here TODO!
One example is Pairwise Interaction Tensor Factorization (TODO: ref), which is defined:
(source: Aggarwal (2016))
\[\begin{array}{lcl} \hat{r}_{ijc} &=& \text{predicted rating of item } j \text{ by user } i \text{ in context } c \\ &=& (\mathbf{U}\mathbf{V}^T)_{ij} + (\mathbf{V}\mathbf{W}^T)_{jc} + (\mathbf{U}\mathbf{W}^T)_{ic} \\ &=& \displaystyle\sum_{s=1}^k \Big(u_{is}v_{js} + v_{js}w_{cs} + u_{is}w_{cs}\Big) \\ m &=& \text{number of unique users} \\ n &=& \text{number of unique items} \\ d &=& \text{number of unique contexts} \\ k &=& \text{dimension of latent space} \\ \mathbf{U} &\in& \mathbb{R}^{m\times k} \quad \text{(matrix of user factors)} \\ \mathbf{W} &\in& \mathbb{R}^{n\times k} \quad \text{(matrix of item factors)} \\ \mathbf{V} &\in& \mathbb{R}^{d\times k} \quad \text{(matrix of context factors)} \\ \end{array}\]
TODO
Inspired by the observation that shallow models (like linear models) tend to be better at memorization (finding specific reoccurring patterns in data) while deep models tend to be better at generalization (approximating the structure of the underlying data-generating system), the Wide and Deep (Cheng et al. 2016) model is an architecture containing both a wide component and a deep component - an attempt to combine the specific strengths of these 2 different models in a unified framework.
Google evaluated this model for app recommendation on their app store Google Play.
image source: author
More specifically:
The Wide & Deep model is designed to solve “large-scale regression and classification problems with sparse inputs.”
The wide portion of the model is a Generalized Linear Model (GLM).
The deep part of the model is a Feed-Forward (Dense) Neural Network.
The parameters of both parts (wide and deep) are optimized simultaneously during model training.
Continuous input features are fed directly into the deep part of the model.
Categorical input features are either:
Embedded into a \(k\)-dimensional numeric representation then fed into the deep part of the model. These embeddings are learned directly during training. Embedding dimension \(k\) is a hyperparameter to be chosen/optimized (i.e. not learned during training).
Embedded into a \(k\)-dimensional numeric representation then fed into the deep part of the model and (independently and in parallel) feature-crossed and then fed into the wide part of the model.
Example: categorical feature [user_occupation]…
…contributes 5-dimensional embedding \(\Big(\text{[user_occupation]=='musician'}\Big)\rightarrow{}[-0.34,0.01,5.90,-6.94,1.12]\) as input to deep part
…contributes binary input \(X_j=\Big(\text{[user_occupation]=='musician'}\Big) \in \{0,1\}\) as input to wide part
…contributes binary crossed feature \(X_j=\Big(\text{[user_occupation]=='musician'}\&\text{[user_gender]=='female'}\Big) \in \{0,1\}\) as input to wide part
…contributes binary crossed feature \(X_j=\Big(\text{[user_occupation]=='musician'}\&\text{[user_city]=='London'}\Big) \in \{0,1\}\), as input to wide part
The outputs of the deep and wide components are then linearly combined (using weights \(\mathbf{W}_{deep}\) and \(\mathbf{W}_{wide}\) learned during training) and a chosen activation function (e.g. sigmoid for binary classification) is applied in order to generate the model output/prediction. A baseline/bias term \(b\) (also learned during training) is also added into the linear combination.
Many variants of this model exist, such as DeepFM ((Guo et al. 2017)), Deep & Cross Network ((Wang et al. 2021)), TODO…
Latent factor models can be integrated with another recommendation architecture (i.e. be included as a module within the overall architecture, where the latent factors are learned simultaneously with the other model parameters) using a simple linear combination:
\[\begin{array}{lcl} \hat{r}_{ij} &=& \underset{\text{main effects:}\\\text{user generosity &}\\\text{item popularity}}{\underbrace{\quad o_i + p_j \quad}} + \underset{\text{matrix factorization/}\\\text{latent factor model}}{\underbrace{\quad \overrightarrow{u}_i \cdot \overrightarrow{v}_j \quad}} + \beta \space f\bigg(\underset{\text{content model}}{\underbrace{\overrightarrow{c}_i^{(user)}, \overrightarrow{c}_j^{(item)}}}, \underset{\text{collaborative model}}{\underbrace{\overrightarrow{r}_i^{(user)}, \overrightarrow{r}_j^{(item)}}}, \underset{\text{context}\\\text{info.}}{\underbrace{\overset{\rightarrow{}}{\hspace{5mm}z_{\space}\hspace{5mm}}}} \bigg)\\ \space &\space& \space \\ \hat{r}_{ij} &=& \text{predicted rating of item } j \text{ by user } i\\ o_i &=& \text{user } i \text{ bias (e.g. user who rates all items highly)} \\ p_j &=& \text{item } j \text{ bias (e.g. item that all users rate highly)} \\ \overrightarrow{u}_i &=& \text{latent factor representation of user } i\\ \overrightarrow{v}_j &=& \text{latent factor representation of item } j \\ \overset{\rightarrow{}}{z} &=& \text{recommendation context information (vector)}\\ \beta &=& \text{adjustment/weight of non-latent portion of model} \\ f() &=& \text{any chosen function (e.g. linear regression)} \\ \overrightarrow{c}_i^{(user)} &=& \text{user } i \text{ attribute/content vector} \\ \overrightarrow{c}_j^{(item)} &=& \text{item } j \text{ attribute/content vector} \\ \overrightarrow{r}_i &=& \text{user } i \text{ observed ratings vector (row of user/item ratings matrix)} \\ \overrightarrow{r}_j &=& \text{item } j \text{ observed ratings vector (column of user/item ratings matrix)} \\ \end{array}\]
Some possible choices of function \(f()\) are:
A Supervised Learning model
A Neighbourhood-based Collaborative Filtering model (the content vectors are ignored)
A Hybrid system
A Content-Based model (the ratings vectors are ignored)
Cascade ranking refers to the process of applying a sequence of increasingly performant models in order to repeatedly reduce the size of the candidate set of items, so as to remove obviously irrelevant items prior to the use of more resource-intensive models.
image source: X. Li et al. (2022)